Appendix

This appendix contains additional technical information that can be useful, but the information is not necessary in order to use Myallo.

Search Site Definitions

Myallo has a few built-in search site definitions, but you can create additional ones. Third parties might also create search site definition files that you can download and use.

An additional search site definition resides in its own file. The file name has an extension of ".searchsite". The file contains text in a standard property list format, but the easiest way to create, edit and delete these files is by using Myallo itself, as part of the Preferences command.

Myallo places newly created search site files in "~/Library/Application Support/Myallo/Search Sites/" where "~" is the path to your user directory. Files can also be placed in "/Library/Application Support/Myallo/Search Sites/", which makes them accessible to all users on the computer.

To create a new search site definition, you should be familiar with URLs, the search site, and HTML coding.

When Myallo accesses a search site, it creates a URL using the search site definition and the search term. This URL must return text that contains URLs that point to articles that match the search term. The search site definition tells Myallo how to construct the URL and how to identify the result URLs. Many, but by no means all, Internet search engines can conform to this mechanism.

For example, suppose an Internet search engine can be accessed with a URL such as this:

http://www.samplesearchengine.com/query.cgi?q="searchterm"

and that submitting this URL results in a web page that contains links to several web pages that contain "searchterm". Such an engine can be used in Myallo.

Any entity whose URL returns a list of URLs can be used as a search site. For example, it would be unusual, but even a URL that points to a plain text file containing some URLs can be used.

When you open the Preferences window, click the Search Sites tab, and click the "+" button, a dialog for creating a new search site appears:

The Name and Comment fields identify the search site and appear in the dialog's list of search sites. The new search site file will be named with the Name field and the extension ".searchsite". The name is required, and the comment is optional.

Myallo constructs the search URL by concatenating the Location field, Prefix field, search term, and Suffix field.

The Location field must contain a valid URL for the search engine. In our example, it would contain "http://samplesearchengine.com".

The Prefix contains the part of the URL that is between the Location and search term. It can be blank is necessary. In our example it would contain "/query.cgi?q="" (including the quote that comes before the search term.)

Myallo supplies the search terms, and always places them between the Prefix and Suffix.

The Suffix contains the rest of the URL beyond the search term. In our example, it would contain """ (a single quote.)

The final four fields help Myallo locate the URLs in the returned page that contains the results. They are optional and can be empty, in which case Myallo will consider every URL in the returned page as a valid result. But on a typical HTML page returned by a search engine, there can be headers, footers, sidebars, advertisements, images and all sorts of things which may contain URLs that are not part of the results proper. Therefore these four fields are used to filter non-result URLs out. Determining the correct strings to place in these fields is usually the trickiest part of defining a search site.

Myallo reads the text returned by the URL, which is typically going to be an HTML page. It then examines the HTML source to find URLs that represent results. Suppose in our example search engine, the URL we created returns the following HTML page:

<HTML><HEAD>
<TITLE>Search Results</TITLE>
</HEAD></BODY>
Welcome to Sample Search Engine!
<a href="http:www.samplesearchengine.com">Home</a>
Here are the results of your search:<br>
<a href="http://www.termmz.com">Search Terms Corp.</a>
<a href="http://www.lookin4u.com">Lookin for U</a>
<br>
<b>Enjoy</b> searching!
</BODY></HTML>

This very simple example contains three URLs, one to the search engine's home page, and two results from the search. In the real world, the returned page wold probably be much more complex. Here, we want to grab the two result URLs, and discard the home page URL.

If we left all four fields blank, the whole page would be scanned for URLs and all three would be considered results.

The List Start field defines the beginning of the section of the page which contains results. Myallo scans the page from the beginning, looking for the first occurrence of the string in List Start. If not found, it assumes the list starts from the beginning of the page, but if it is found, everything from the start of the page to the end of the List Start string is discarded. All searches are case-insensitive.

In our example, if we put "search:" in List Start, everything up to there would be discarded, including the first URL. That would be enough to let Myallo successfully parse the results. (In this case using "<br>" would also work, as would anything we knew uniquely came after the URLs that are before the list of results. However, we could do better defining this site, and in the real would, it is extremely likely we would have to.

The List End field defines the end of the section of the page which contains results. If it is blank, the list is assumed to end at the end of the page. Myallo scans the page forward from the List Start point until it finds the first occurrence of the List End string. In the example, "<b>Enjoy<b>" would work, assuming we knew the engine would never return a boldface word in the midst of its results.

Once it has defined the section of the page containing the list, Myallo gathers result URLs, taking Item Start and Item End into account. It starts gathering by scanning forward from the list start, and stops when it gets to the list end.

If Item Start is nonblank, Myallo searches for an occurrence of it, ignoring all it passes over. If it is not found, the gathering process ends. If Item Start is blank, it doesn't do this step.

Next, it scans for the first URL it finds. This is taken as a result URL. If it doesn't find a URL, the gathering process ends.

Then, it scans for an occurrence of Item End, ignoring all it passes over. If it is not found, the gathering process ends. If Item End is blank, it doesn't do this step.

All the URLs gathered are considered to be results of the search. In our example, we might set Item Start to "a href=" and Item End to "</a>". This would isolate the result URLs. In more complex examples, the Item Start and Item End strings can be very handy, as there may be additional or multiple links surrounding each search result.

Myallo Search Sites

Starting with Myallo version 1.2, a special URL scheme is supported. Currently this is used to implement Spotlight searches when running under Mac OS X 1.4 or above. A Spotlight searchsite that uses this facility is built into Myallo 1.2 and above.

Currently only one URL format is supported. The Searchsite URL must be "x-myallo:///spotlight/content/?", which indicates a spotlight search for local file text content is to be performed. The last six fields in the search dialog should remain empty. Spotlight searches are explained in the main part of this guide.

Advanced Search Strings

Normally, the name of an interest item is used as a search term, but you can specify an advanced search string, which has advanced features. To set the advanced search string, select the interest, and open the details drawer with the Show Details command. The drawer contains the advanced search string field.

Advanced search strings use a special matching mechanism which treats the string as a "regular expression" or regex "pattern". The pattern is used in computer languages such as Perl, and several books and online references are available. This appendix describes the basic rules.

Literal expressions

In the simplest cases, a pattern is just a literal string that must match exactly. For example, the pattern:

regexp


matches the string "regexp" and no others. That pattern would match in the same way as if the interest was named "regexp" and the advanced search string were left empty.

Some characters have a special meaning when they occur in a pattern. They aren't matched literally as in the previous example, but instead denote a more general pattern. For example, the character * is used to indicate that the preceding element of a pattern may be repeated 0, 1, or more times. In the pattern:

smooo*th


the * indicates that the preceding o can be repeated 0 or more times. So the pattern matches:

smooth
smoooth
smooooth
smoooooth
...


Suppose you want to write a pattern that literally matches a special character like * -- in other words, you don't want to * to indicate a permissible repetition, but to match * literally. This is accomplished by quoting the special character with a backslash. The pattern:

smoo\*th


matches the string:

smoo*th


and no other strings.

Other characters which have special meaning in a pattern include

+ ? | ( ) { and }

Character sets


. matches any character. For example:

p.ck


matches

pick
pack
puck
pbck
pcck
p.ck

...


[ begins a character set. A character set is similar to . in that it matches not a single, literal character, but any of a set of characters. [ is different from . in that with [, you define the set of characters explicitly.

There are three basic forms a character set can take.

In the first form, the character set is spelled out:

[<cset-spec>] -- every character in <cset-spec> is in the set.


In the second form, the character set indicated is the negation of a character set is explicitly spelled out:

[^<cset-spec>] -- every character *not* in <cset-spec> is in the set.


A <cset-spec> is more or less an explicit enumeration of a set of characters. It can be written as a string of individual characters:

[aeiou]


or as a range of characters:

[0-9]


These two forms can be mixed:

[A-za-z0-9_$]


Note that special characters (such as *) are not special within a character set. -, as illustrated above, is special, except, as illustrated below, when it is the first character mentioned.

This is a four-character set:

[-+\*]


The third form of a character set makes use of a pre-defined "character class":

[[:class-name:]] -- every character described by class-name is in the set.


The supported character classes are:

alnum - the set of alpha-numeric characters
alpha - the set of alphabetic characters
blank - tab and space
cntrl - the control characters
digit - decimal digits
graph - all printable characters except space
lower - lower case letters
print - the "printable" characters
punct - punctuation
space - whitespace characters
upper - upper case letters
xdigit - hexidecimal digits


Finally, character class sets can also be inverted:

[^[:space:]] - all non-whitespace characters


Character sets can be used in a regular expression anywhere a literal character can.

Subexpressions

A subexpression is a regular expression enclosed in ( and ). A subexpression can be used anywhere a single character or character set can be used.

Subexpressions are useful for grouping pattern constructs. For example, the repeat operator, *, usually applies to just the preceding character. Recall that:

smooo*th


matches

smooth
smoooth
...


Using a subexpression, we can apply * to a longer string:

banan(an)*a


matches

banana
bananana
banananana
...


Repeated Subexpressions

* is the repeat operator. It applies to the preceding character, character set, or subexpression. It indicates that the preceding element can be matched 0 or more times:

bana(na)*


matches

bana
banana
bananana
banananana
...


+ is similar to * except that + requires the preceding element to be matched at least once. So while:

bana(na)*


matches

bana


bana(na)+


does not. Both match

banana
bananana
banananana
...


Thus, bana(na)+ is short-hand for banana(na)*.

Optional Subexpressions

? indicates that the preceding character, character set, or subexpression is optional. It is permitted to match, or to be skipped:

CSNY?


matches both

CSN


and

CSNY

===

An interval expression, {m,n} where m and n are non-negative integers with n >= m, applies to the preceding character, character set, subexpression or backreference. It indicates that the preceeding element must match at least m times and may match as many as n times.

For example:

c([ad]){1,4}


matches

car
cdr
caar
cdar
...
caaar
cdaar
...
cadddr
cddddr

Counted Subexpressions

An alternative is written:

regexp-1|regexp-2|regexp-3|...


It matches anything matched by some regexp-n. For example:

Crosby, Stills, (and Nash|Nash, and Young)


matches

Crosby, Stills, and Nash


and

Crosby, Stills, Nash, and Young